Supplement for Where to Add Actions in Human-in-the-Loop Reinforcement Learning
ثبت نشده
چکیده
UCRL This strongly optimistic approach is based on UCRL2 (Jaksch, Ortner, and Auer 2010), which defines a confidence set over each transition/reward distribution, and takes the maximum valid distribution in the confidence set when planning. Specifically, in a finite-horizon setting it allows the L1 norm of the transition distribution to deviate from the MLE by at most √ 14 log(SA`τ`/δ) max(N,1) , where ` is the number of episodes, N is the number of transition samples, and δ is a user-specified confidence parameter. UCRL’s bound incorporates global uncertainty to ensure that the true MDP is within the confidence set with high probability (Auer and Ortner 2007). We use this bound to quantify the uncertainty over our outcome distribution, setting δ = 0.05 as in Osband et al. 2013.1. The advantage of UCRL2’s constraint on the L1 is that it is easy to calculate the most optimistic distribution that obeys the constraint, as explained in Strehl and Littman 2004. Since we are in a finite horizon setting, we calculate the most optimistic distribution for each (s, t) pair. MBIE Model-based Interval Estimation (MBIE) (Strehl and Littman 2004; 2008) is a very similar idea to UCRL, but simply bounds the local L1 divergence at each state with probability 1− δ. The original MBIE algorithm (Strehl and Littman 2004) used a bound on their transition dy-
منابع مشابه
Where to Add Actions in Human-in-the-Loop Reinforcement Learning
In order for reinforcement learning systems to learn quickly in vast action spaces such as the space of all possible pieces of text or the space of all images, leveraging human intuition and creativity is key. However, a human-designed action space is likely to be initially imperfect and limited; furthermore, humans may improve at creating useful actions with practice or new information. Theref...
متن کاملReinforcement Learning Based PID Control of Wind Energy Conversion Systems
In this paper an adaptive PID controller for Wind Energy Conversion Systems (WECS) has been developed. Theadaptation technique applied to this controller is based on Reinforcement Learning (RL) theory. Nonlinearcharacteristics of wind variations as plant input, wind turbine structure and generator operational behaviordemand for high quality adaptive controller to ensure both robust stability an...
متن کاملHierarchical Functional Concepts for Knowledge Transfer among Reinforcement Learning Agents
This article introduces the notions of functional space and concept as a way of knowledge representation and abstraction for Reinforcement Learning agents. These definitions are used as a tool of knowledge transfer among agents. The agents are assumed to be heterogeneous; they have different state spaces but share a same dynamic, reward and action space. In other words, the agents are assumed t...
متن کاملUtilizing Generalized Learning Automata for Finding Optimal Policies in MMDPs
Multi agent Markov decision processes (MMDPs), as the generalization of Markov decision processes to the multi agent case, have long been used for modeling multi agent system and are used as a suitable framework for Multi agent Reinforcement Learning. In this paper, a generalized learning automata based algorithm for finding optimal policies in MMDP is proposed. In the proposed algorithm, MMDP ...
متن کاملA Q-learning Based Continuous Tuning of Fuzzy Wall Tracking
A simple easy to implement algorithm is proposed to address wall tracking task of an autonomous robot. The robot should navigate in unknown environments, find the nearest wall, and track it solely based on locally sensed data. The proposed method benefits from coupling fuzzy logic and Q-learning to meet requirements of autonomous navigations. Fuzzy if-then rules provide a reliable decision maki...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016